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variable selection that first estimates the regression function, yielding 
a "pre-conditioned" response variable. The primary method used for 
this initial regression is supervised principal components. Then we 
apply a standard procedure such as forward stepwise selection or the 
LASSO to the pre-conditioned response variable. In a number of sim- 
ulated and real data examples, this two-step procedure outperforms 
forward stepwise selection or the usual LASSO (applied directly to 
the raw outcome). We also show that under a certain Gaussian la- 
tent variable model, application of the LASSO to the pre-conditioned 
response variable is consistent as the number of predictors and ob- 
servations increases. Moreover, when the observational noise is rather 
large, the suggested procedure can give a more accurate estimate than 
LASSO. We illustrate our method on some real problems, including 
survival analysis with microarray data. 

1 Introduction 

In this paper we consider the problem of fitting linear (and other related) 
models to data for which the number of features p greatly exceeds the number 
of samples n. This problem occurs frequently in genomics, for example in 
microarray studies in which p genes are measured on n biological samples. 
The problem of model selection for data where number of variables is 
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typically comparable or much larger than the sample size has received a lot 
of attention recently. In particular, various penalized regression methods 
are being widely used as means of selecting the variables having nonzero 
contribution in a regression model. Among these tools the L 1 penalized re- 
gression or LASSO (Tibshirani (1996)) is one of the most popular techniques. 
The Least Angle Regression (LAR) procedure Efron et al. (2004) provides 
a method for fast computation of LASSO solution in regression problems. 
Osborne et al. (2000) derived the optimality conditions associated with the 
LASSO solution. Donoho & Elad (2003) and Donoho (2004) proved some 
analytical properties of the L\ penalization approach for determining the 
sparsest solution for an under-determined linear system. Some statistical 
properties of the LASSO-based estimator of the regression parameter have 
been derived by Knight & Fu (2000). In the context of high-dimensional 
graphs, Meinshausen & Buhlmann (2006) showed that the variable selection 
method based on lasso can be consistent if the underlying model satisfies 
some conditions. Various other model selection criteria have been proposed 
in high dimensional regression problems. Fan & Li (2005) and Shen & Ye 
(2002) gave surveys of some of these methods. 

However, when the number of variables (p) is much larger than the num- 
ber of observations (precisely p n ~ cn^ for some £ G (0, 1)) Meinshausen 
(2005) showed that the convergence rate of risk of the LASSO estimator can 
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be quite slow. For finite-dimensional problems, Zou (2005) found a neces- 
sary condition for the covariance matrix of the observations, without which 
the LASSO variable selection approach is inconsistent. Zhao & Yu (2006) 
derived a related result for ther p > N case. 

Various modifications to LASSO have been proposed to ensure that on 
one hand, the variable selection process is consistent and on the other, the es- 
timated regression parameter has a fast rate of convergence. Fan & Li (2005) 
proposed the Smoothly Clipped Absolute Deviation (SCAD) penalty for vari- 
able selection. Fan & Peng (2004) discussed the asymptotic behavior of this 
and other related penalized likelihood procedures when the dimensionality of 
the parameter is growing. Zou (2005) proposed a non-negative Garrote-type 
penalty (that is re-weighted by the least squares estimate of the regression 
parameter) and showed that this estimator has adaptivity properties when 
p is fixed. Meinshausen (2005) proposed a relaxation to the LASSO penalty 
after initial model selection to address the problem of high bias of LASSO 
estimate when p is very large. 

All of these methods try to solve two problems at once: 1) find a good 
predictor y and 2) find a (hopefully small) subset of variables to form the basis 
for this prediction. When these problems are especially difficult. In 

this paper we suggest that they should be solved separately, rather than both 
at once. Moreover, the method we propose utilizes the correlation structure 
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of the predictors, unlike most of the methods cited. We propose a two-stage 
approach : 

(a) find a consistent predictor y of the true response, 

(b) using the pre-conditioned outcome y, apply a model fitting procedure 

(such as forward stagewise selection or the LASSO) to the data (x, y). 

In this paper we show that the use of y in place of y in the model selection step 
(b) can mitigate the effects of noisy features on the selection process under 
the setting of a latent variable model for the response, when the number of 
predictor variables that are associated with the response grows at a slower 
rate than the number of observations, even though the nominal dimension of 
the predictors can grow at a much faster rate. 

This paper is organized as follows. In section[2]we define the pre-conditioning 
method and give an example from a latent variable model. Section [3] discusses 
a real example from a kidney cancer microarray study, and application of the 
idea to other settings such as survival analysis. In section @] we give details 
of the latent variable model, and show that the LASSO applied to the pre- 
conditioned response yields a consistent set of predictors, as the number of 
features and samples goes to infinity. Finally in section [5] we discuss and 
illustrate the pre-conditioning idea for classification problems. 
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2 Pre-conditioning 

Suppose that the feature measurements are x« = (x^, x i2 , . . .x ip ) and out- 
come values i/i, for % = 1, 2, . . . n. Our basic model has the form 

v 

E(^|^) = flp + y^Sjj-flj, z = l,2,...,n (1) 

Two popular methods for fitting this model are forward stepwise selection 
(FS) and the LASSO Tibshirani (1996). The first method successively enters 
the variable that most reduces the residual sum of squares, while the second 
minimizes the penalized criterion 

v v 

J{9,fi) = J2(Vi - $0 + ^2o j x ij ) 2 + f ji^2\e j \. (2) 

i 3=1 j=l 

Efron et al. (2004) develop the least angle regression (LAR) algorithm, for 
fast computation of the LASSO for all values of the tuning parameter /i > 0. 

Usually model selection in the general model ([1]) is quite difficult when 
p ^> n, and our simulations confirm this. To get better results we may need 
further assumptions about the underlying model relating to £j. In this 
paper, we assume that yi and connected via a low- dimensional latent 

variable model, and use a method that we shall refer to as pre-conditioning 
to carry out model selection. In this approach, we first find a consistent 
estimate yi by utilizing the latent variable structure, and then apply a fit- 
ting procedure such as forward stepwise regression or the LASSO to the 
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data (xi,yi),i = 1, 2, ... n. The main technique that we consider for the ini- 
tial pre-conditioning step is supervised principal components (SPC) (Bair & 
Tibshirani (2004), Bair et al. (2006)). This method works as follows: 

a) we select the features whose individual correlation with the outcome is 

large, 

b) using just these features, we compute the principal components of the 

matrix of features, giving Vi, V^, ■ ■ ■ V m m{N, P }- The prediction iji is the 
least squares regression of yi on the first K of these components. 

Typically we use just the first or first few supervised principal components. 
Bair et al. (2006) show that under an assumption about the sparsity of the 
population principal components, as p, n — > oo, supervised principal compo- 
nents gives consistent estimates for the regression coefficients while the usual 
principal components regression does not. We give details of this model in 
section HI and provide a simple example next. 

2.1 Example: latent variable model 

The following example shows the main idea n this paper. Consider a model 
of the form: 

Y = (3 + (3 1 V + a 1 Z (3) 
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In addition, we have measurements on a set of features Xj indexed by j G A, 
for which 

Xj = a 0j + aijV + a e j7 j G 1, . . . , p. (4) 

The quantity V is an unobserved or latent variable. The set A represents 
the important features (meaning that a\j 7^ 0, for j G A) for predicting Yj. 
The errors Zi and are assumed to have mean zero and are independent of 
all other random variables in their respective models. All random variables 
(V, Z, Cj) have a standard Gaussian distribution. 

2.2 Example 1 

For illustration, we generated data on p = 500 features and n = 20 samples, 
according to this model, with Pi = 2, j3 = 0,a j = 0, atij — 1, o\ — 2.5, 
A = {1,2, ...20}. Our goal is to predict Y from X 1 ,X 2 ,...X p , and in 
the process, discover the fact that only the first 20 features are relevant. 
This is a difficult problem. However if we guess (correctly) that the data 
were generated from model (jlj), our task is made easier. The left panel 
of Figure [1] shows the correlations Corr(V,X,-) plotted versus Corr(Y, Xj) 
for each feature j. The first 20 features are plotted in red, and can be 
distinguished much more easily on the basis of Corr(V, Xj) than Corr(Y, Xj). 
However this requires knowledge of the underlying latent factor V, which is 
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corr(Y,X) corr(Y,X) 

Figure 1: Results for simulated data. Left panel shows the correlation between the 
true latent variable V and gene expression X for each of the genes plotted against 
the correlation between Y and gene expression. The truly non-null genes are shown 
in red. The right panel is the same, except that the estimated latent variable V 
(from supervised principal components) replaces V . We see that correlation with 
either the true or estimated latent factor does a better job at isolating the truly 
non-null genes. 

not observed. 

The right panel shows the result when we instead estimate Vi from the 
data, using the first supervised principal component. We see that the corre- 
lations of each feature with the estimated latent factor also distinguishes the 
relevant from the irrelevant features. 

Not surprisingly, this increased correlation leads to improvements in the 
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performance of selection methods, as shown in Table [TJ We applied four 
selection methods to the 20 simulated data sets from this model: FS: simple 
forward stepwise regression; SPC/FS: forward stepwise regression applied to 
the pre-conditioned outcome from supervised principal components; LASSO, 
and SPC/LASSO: LASSO applied to pre-conditioned outcome from super- 
vised principal components. The table shows the average number of good 
variables selected among the first 1,2,5,10, and 20 variables selected, and 
the corresponding test errors. Pre-conditioning clearly helps both forward 
selection and the lasso. 

2.3 Example 2. 

The second example was suggested by a referee. It is somewhat artifical but 
exposes an important assumption that is made by our procedure. We define 
random variables (Y,Xi,X 2 ,X 3 ) having a Gaussian distribution with mean 
zero and inverse covariance matrix 

( 2 1 1 1 \ 



s- 1 



12 1 
10 2 1 
1112 



We define 297 additional predictors that are iV(0, 1). The population regres- 
sion coefficient is (3 = (—1, —1, —1, 0, 0, . . .) while the (marginal) correlation 
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Method 



Mean # of good variables, 

when selecting first: Test error when selecting first: 





1 


5 


10 


20 


1 


5 


10 


20 


FS 


0.82 


0.98 


1.12 


1.58 


267.36 


335.4 


353.52 


357.07 


SPC/FS 


0.94 


2.66 


2.86 


3.12 


241.88 


229.47 


231.52 


232.28 


LASSO 


0.88 


2.05 


3.17 


3.29 


206.54 


184.56 


186.71 


205.85 


SPC/LASSO 


0.92 


4.21 


7.75 


9.71 


212.23 


197.07 


183.04 


178.19 



Table 1: Four selection methods to the 20 simulated data sets from the model 
of Example 1. Shown are the number of good variables selected among the 
first 1,2,10, and 20 variables selected, and the corresponding test errors. Pre- 
conditioning clearly helps in both cases, and the lasso outperforms forward 
selection. 



11 



Method 



Mean # of good variables. 



when selecting first: 



1 



2 



3 



4 



LASSO 



1.0 2.0 3.0 



3.0 



SPC/LASSO 



1.0 2.0 2.0 



2.0 



Table 2: Performance of LASSO and pre-conditioned LASSO in the second 
simulation example. 

of each predictor with Y is p = (—0.5, —0.5, 0, 0, 0, . . .). Hence X 3 has zero 
marginal correlation with Y but has a non-zero partial correlation with Y, 
(since (S _1 ) 14 = 1). The number of good variables when selecting the first 
1,2,3 or 4 predictors is shown in Table El 

We see that the LASSO enters the 3 good predictors first in every simula- 
tion, while the pre-conditioned version ignores the 3rd predictor. Supervised 
principal components screens out this predictor, because it is marginally in- 
dependent of Y. 

Pre-conditioning with supervised principal components assumes that any 
important predictor (in the sense of having significantly large nonzero re- 
gression coefficient) will also have a substantial marginal correlation with 
the outcome. This need not be true in practice, but we believe it will often 
be a good working hypothesis in many practical problems. 
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- Method 


Mean # of good variables. 




when selecting first: 




5 10 20 50 


LASSO 


2.92 5.88 9.04 9.16 


SPC/LASSO 


2.49 5.13 10.32 19.73 



Table 3: Performance of LASSO and pre-conditioned LASSO in the third 
simulation example. 

2.4 Example 3. 

Our third simulation study compares the lasso to the pre-conditioned lasso, in 
a more neutral setting. We generated 1000 predictors, each having a N(0, 1) 
distribution marginally. The first 40 predictors had a pairwise correlation of 
0.5, while the remainder were uncorrelated. 
The outcome was generated as 

40 

Y = J2^X J+ aZ (5) 

with Z, f3j ~ N(0, 1) and a = 5. Hence the outcome is only a function of the 
first 40 ("good") predictors. 

We generated 100 datasets from this model: the average number of good 
variables selected by the lasso and pre-conditioned lasso is shown in Table 
El Note that with just n = 50 samples, the maximum number of predictors 
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in the model is also 50. While neither method is successful at isolating the 
bulk of the 40 good predictors, the pre-conditioned lasso finds twice as many 
good predictors as the lasso in the full model. 

3 Examples 

3.1 Kidney cancer data 

Zhao et al. (2005) collected gene expression data on 14,814 genes from 177 
kidney patients. Survival times (possibly censored) were also measured for 
each patient, as well as a number of clinical predictors including the grade 
of the tumor: 1 (good) to 4 (poor). 

The data were split into 88 samples to form the training set and the 
remaining 89 formed the test set. For illustration, in this section we try to 
predict grade from gene expression. In the next section we predict survival 
time (the primary outcome of interest) from gene expression. Figure [2] shows 
the training and test set correlations between grade and its prediction from 
different methods. We see that for both forward selection and the LASSO, use 
of the supervised principal component prediction y as the outcome variable 
(instead of y itself) makes the procedure less greedy in the training set and 
yields higher correlations in the test set. While the correlations in the test 
set are not spectacularly high, for SPC/FS and SPC/LASSO they do result 
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in a better predictions in the test set. 

3.2 Application to other regression settings 

Extension of our proposal to other kinds of regression outcomes is very simple. 
The only change is in step (a) of supervised principal components algorithm, 
where we replace the correlation by an appropriate measure of association. 
In particular, the likelihood score statistic is an attractive choice. 

3.3 Survival analysis 

Perhaps the most common version of the p > n regression problem in genomic 
studies is survival analysis, where the outcome is patient survival (possibly 
censored). Then we use the partial likelihood score statistic from Cox's 
proportional hazards score statistic (see Chapter 4 of Kalbfleisch & Prentice 
(1980)), in step (a) of supervised principal components. After that, we can 
(conveniently) use the usual least squares version of FS or LASSO in step (2) 
of the modeling process. Hence the computational advantages of the least 
angle regression algorithm can be exploited. 

Figure [3] shows the result of applying forward stepwise Cox regression (top 
left panel), forward stepwise selection applied to the SPC predictor (top right 
panel), LASSO for the Cox model (bottom left panel) and LASSO applied to 
the SPC predictor (bottom right panel). The bottom left panel was computed 
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Figure 2: Kidney cancer data: predicting tumor grade. Correlation of different 
predictors with the true outcome, in the training and test sets, as more and more 
genes are entered. 
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using the glmpath R package of Park & Hastie (2006), available in the CRAN 
collection. In each case we obtain a predictor y, and then use y as a covariate 
in a Cox model, in either the training or test set. The resulting p- values from 
these Cox models are shown in the figure. We see that forward stepwise Cox 
regression tends to overfit in the training set, and hence the resulting test- 
set p-values are not significant. The two stage SPC/FS procedure fits more 
slowly in the training set, and hence achieves smaller p-values in the test set. 
"SPC/LASSO" , the LASSO applied to the pre-conditioned response from 
supervised principal components, performs best and is also computationally 
convenient: it uses the fast LAR algorithm for the lasso, applied to the pre- 
conditioned response variable. 

The horizontal green line shows the test set p-value of the supervised 
principal component predictor. We see that the first 10 or 15 genes chosen 
by the LASSO have captured the signal in this predictor. 

We have used the pre-conditioning procedure in real microarray studies. 
We have found that it is useful to report to investigators not just the best 10 
or 15 gene model, but also any genes that have high correlation with this set. 
The enlarged set can be useful in understanding the underlying biology in 
experiment, and also for building assays for future clinical use. A given gene 
might not be well measured on a microarray for a variety of reasons, and 
hence it is useful to identify surrogate genes that may be used in its place. 
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Figure H] shows the average absolute Cox score of the first k features 
entered by forward stepwise selection (red) and the pre-conditioned version 
(green), as k runs from 1 to 30. The right panel shows the average absolute 
pairwise correlation of the genes for both methods. We see that the methods 
enter features of about the same strength, but pre-conditioning enters genes 
that are more highly correlated with one another. 

4 Asymptotic analysis 

In this section we lay down a mathematical formulation of the problem and 
pre-conditioning procedure in the context of a latent factor model for the 
response. We show that the procedure combining SPC with LASSO, un- 
der some assumptions about the correlation structure among the variables, 
leads to asymptotically consistent variable selection in the Gaussian linear 
model setting. We consider the class of problems where one observes n inde- 
pendent samples (jji, Xj) where yi is a one dimensional response and x« is a 
p-dimensional predictor. Individual coordinates of the vector Xj are denoted 
by Xij where the index j G {1, . . . ,p} correspond to the j-th predictor. We 
denote the n x p matrix ((;%))i<i<n,i<j<p by X and the vector (|/i)™ =1 by Y. 
Henceforth, unless otherwise stated, we do not make a distinction between 
the realized value (Y, X) and the random elements (namely, the response and 
the p predictors) that they represent. 
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Figure 3: Kidney cancer data: predicting survival time. Training set p-values 
(red) and test set p-values (green) for four different selection methods as more and 
more genes are entered. Horizontal broken lines are drawn at 0.05 (black) and the 
test set p-value for the supervised principal component predictor 0.00042 (green). 
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Figure 4: Kidney cancer data: predicting survival time. Left panel shows the 
average absolute Cox score of the first k genes entered by forward stepwise selection 
(red) and the pre-conditioned version (green), as k runs from 1 to 30. The right 
panel shows the average absolute pairwise correlation of the genes for both methods. 



20 



The interest is in identifying the set of predictors Xj which are (linearly) 
related to Y. A regression model will be of the form E(Y|x) = # T x for some 
9 E MP. Here we assume that the joint distribution of X is Gaussian with 
zero mean and covariance matrix E = E p . The relationship between Y and 
X is assumed to be specified by a latent component model to be described 
below. 



Suppose that the spectral decomposition of £ is given by £ = Y7k=i 4ufcUj[ > 
where £ 1 > . . . > £ p > and u 1; . . . , u p form an orthonormal basis of W. We 
consider the following model for S. 

Assume that there exists an M > 1 such that 



4 = A fc + a 2 , k = 1, . . . , M, and 4 = < fc = M + 1, . . . ,p, (6) 



where Ai > . . . > Am > and er > 0. This model will be referred to as the 
"noisy factor model" . To see this, notice that under the Gaussian assumption 
the matrix X can be expressed as 



where Vi, . . . , v M are i.i.d. N n (0, 1) vectors (the factors), and E is an n x p 
matrix with i.i.d. A r (0, 1) entries, and is independent of Vi,...,Vm- This 
matrix is viewed as a noise matrix. 



4.1 Model for X 



M 




(7) 



k=i 
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In the analysis presented in this paper throughout we use as the 
model for X, even though it can be shown that the analysis applies even 
in the case where £k+i, ■ ■ ■ ,£ P are decreasing and sufficiently well separated 
from £i, . . . , £ K . 



Assume the following regression model for Y . Note that this is a more general 
version of (j3J), even though we assume that Y has (unconditional) mean 0. 



where <j\ > 0, 1 < K < M, and Z has iV n (0, /) distribution and is indepen- 
dent of X. 

4.3 Least squares and feature selection 

We derive expressions for the marginal correlations between Y and Xj, 
for j = l,...,p and the (population) least squares solution, viz. 6 := 
argmin^E || Y — X£ |||, in terms of the model parameters. Let V := 
{1, . . . ,p}. The marginal correlation between X = (Xj)? =l and Y is given by 



4.2 Model for Y 



K 




(8) 



A 





k=l 



22 



The population regression coefficient of Y on X, is given by 

M K 



k=i 

M 



k=l 



^ Afc + Q (7 



fc=l 



K 



k=l 



K 



if 



fcUfc. 



E-^VAfcUfc] 
(10) 



Afc + 0n 

fc=l « 1 o fc=1 
Now, define Wj = (\0^u jl , . . . , \O^UjK) T - Let V = {j :|| Wj \\ 2 ^ 0}. 
Observe that = (3 T Wj, and 0,- = p T D] ( 1 Wj, where D K = diag(£i, . . . , i K ). 
So if we define B := {j : E J2/ 7^ 0}, and .4. = {j : 0j 7^ 0}, then B C V and 
icD. 

This gives rise to the regression model: 



Y = X9 + a £ e, 



;n) 



where 



K 



2 Afe 



fc=i fe=i 



Afc + on 



0^ + ^^, 
(12) 



and £ has i.i.d. iV(0, 1) entries and is independent of X. 

Note also that, the population partial covariance between Y and X c 
given X© (given by T, yC \v ■= ^ y c ~ ^ y v^vv^vc) , for any subset C C V c , 
where V c := V \ V, is 0. However the corresponding statement is not true 
in general if one replaces T> by either A or B. Therefore, ideally, one would 
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like to identify V. However, it may not be possible to accomplish this in 
general when the dimension p grows with the sample size n. Rather, we 
define the feature selection problem as the problem of identifying A, while 
the estimation problem is to obtain an estimate of 9 from model (fTT|) . 

Observe that, if either K = 1 or Ai = • • • = Xk, then A = B. In the 
former case we actually have A = B = T>. In these special cases, the feature 
selection problem reduces to finding the set B, which may be done (under 
suitable identifiability conditions) just by computing the sample marginal 
correlations between the response and the predictors and selecting those vari- 
ables (coordinates) for which the marginal correlation exceeds an appropriate 
threshold. The major assumptions that we shall make here for solving the 
problem are that (i) A C B, (ii) B can be identified from the data (at least 
asymptotically), (iii) cardinality of B (and hence that of A) is small com- 
pared to n, and (iv) the contribution of the coordinates B c in the vectors 
Ui, ...,Uk is asymptotically negligible in an L 2 sense. If these conditions 
are satisfied, then it will allow for the identification of A, even as dimension 
increases with the sample size. We make these (and other) conditions more 
precise in Section H~T1 
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4.4 SPC as a preconditioner 

The formulation in the previous section indicates that one may use some pe- 
nalized regression methods to estimate the regression parameter 9 from the 
model (ITT]) . However, standard methods like LASSO do not use the covari- 
ance structure of the data. Therefore if one uses the underlying structure for 
E, and has good estimates of the parameters (u^, then one can hope to 
be able to obtain a better estimate 8, as well as identify A as n — > oo. 

We focus on (J7|) and (JH]). In general it is not possible to eliminate the 
contribution of E entirely from an estimate of even if we had perfect 
knowledge of (uk,£k)- To understand this, note that, the conditional distri- 
bution of Vfc given X is the same as the conditional distribution of given 
Xufc. The latter distribution is normal with mean ^Xu,t and covariance 

2 

matrix Y~I n - This means that any reasonable procedure that estimates the 
parameters (u^, i^) can only hope to reduce the effect of the measurement 
noise in Y, viz. <j\Z. 

Keeping these considerations in mind, we employ a two stage procedure 
described in the following section for estimating 9. In order to fit the model 
f fTTj) using SPC procedure, it is necessary to estimate the eigenvectors u k , 
k = 1, . . . , M. When £ is large (in the sense that the fraction does not con- 
verge to as n — > oo), in general it is not possible to estimate u& consistently. 
However, if are sparse, in the sense of having say q non-zero components, 

25 



where ^ — > 0, then Bair et al. (2006) showed that under suitable identifia- 
bility conditions, it is possible to get asymptotically consistent estimators of 
Ui, . . . , uk, where the consistency is measured in terms of convergence of the 
L 2 distance between the parameter and its estimator. 

4.5 Algorithm 

In this section we present the algorithm in detail. 
Step 1 Estimate (ui,€i), . . . , (u K ,£ K ) by SPC procedure in which only those 



predictors Xj whose empirical correlation with response Y is above 
a threshold r n are used in the eigen-analysis. Call these estimates 



V k := — 7^Xu fe is the fc-th principal component of the predictors (under 

V4 

the SPC procedure). Define Y = P K Y. 

Step 3 Estimate 9 from the linear model Y = X.6+ error, using the LASSO 
approach with penalty /i n > 0. 

Since by definition ^(Xu^XuV) = Pk^kk', it follows that 



{ u fc)4}jfcLi- 



Step 2 Let P K : 



Proj (Vi, . . . , Vk) be the projection onto V±, . . . , Vk, where 



K 



1 



P K = Proj(Xu 1 ,...,X.u K ) = J2 



Xu fc 



(Xu k )(Xu k ) T = j-(Xu k )(Xu k ) T . 
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4.6 Analysis of the projection 

We present an expansion of the projected response Y := PrY that will be 
useful for all the asymptotic analyses that follow. Using the representation 
of Pjc in (TTHT) and invoking ([7]) and (JSJ), we get 

K 3 k 1 K K By 1 K 1 1 

k=l c k k=l k'^k c k t=l c k 

= ^v^ l || Vfc ||2 ( Ufc; g fc )xu fc + y^y^ ^y^ -(v f , v fc )(u;,u fc )xu fc 

fe=l 4 n fe=l i^fc 4 n 

A ' K M ByV^i 

~~j l ~^( Vh v ^)(ui, u fe )xu, : 



k=l k'+k 1=1 



4 n 



K K By 1 K 11 

k=l k'=l tfc k=l ^ k 

K „ n ,,9 . K K 

Hui\l Xui II V;,' II* 

>Xui 

fc=l w fc=l fcYfc fc 

^ fc, y^ -(v fc , v fe /) (u fc , 5fc)xu fc 

^ < ' ^ fin 



K 1 II v II 2 1 K K a/XT || v / || 2 

X6> + xy^^^/x^— (u k ,u k )u k - —u fc ) +y^y^ — ~ (u k >,u k ): 



k=\ k'^k 



4 " 



+°"o 5^ 5^ — — (Eufc, Vfc/)Xufc + o"! y^ --(Xun., Z)Xufc + (14) 

' ' P, n p, n 

fc=i fc'=i c fc fc =1 tfc 

for some vector R n e M n . This is an asymptotically unbiased regression 

model for estimating 9 provided (ufc,4)fc=i * s an asymptotically consistent 

estimator for [\i k ,l k )^ =1 . 
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4.7 Assumptions 

In this section we give sufficient conditions for the consistency of the variable 
selection aspect of the SPC preconditioning procedure. The methods of Zou 
(2005) and Knight & Fu (2000) are not applicable in our situation since 
the dimension is growing with the sample size. For most parts, we make 
assumptions similar to those in Meinshausen & Biihlmann (2006) for the 
relationship among the variables. 

Al The eigenvalues Ai, . . . , Am satisfy 

(i) Ai > . . . > \ K > A^+i > . . . > A M > 0. 

(ii) mini< fc <^(Afc - A fe+ i) > C for some C > (fixed). 

(iii) Ai < A max for some A max fixed. Also, oo is fixed. 

A2 a\ = 0(n K °) for some k G (0, §). 

A3 \A\ = q n , \B\ = q n such that q n = 0(n Kl ) for some Ki G (0, \). 

A3' p n , the number of variables, satisfies the condition that there is an 
a > such that logp n = 0(n a ) for some a G (0, 1). 

A4 There exists a p n satisfying p„n 1 / 2 (logp n ) -1 / 2 — > oo as n — > oo such 
that 

mi n|^=^|>Pn- (15) 
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A5 There exists a 8 n with 5 n = o( n ^ ) such that II w j 111^ d n . 

A 6 There exists an r\ n > satisfying 77" 1 = 0(n K ' 2 ) for some k 2 < |(1 — 
Ko Vki), such that 

min > ?7 n . (16) 
A7 There exists a 5 G (0, 1) such that 

|| S^E^sign^) || 00 < 5. (17) 

A8 There is a < 00 such that, 

ma , x II £^ E Ajj ||i< 1?, where ^ :=A\{j}. (18) 

A few remarks about these conditions are in order. First, condition Al 
about the separation of the eigenvalues is not really necessary, but is assumed 
to avoid the issue of un-identifiability of an eigenvector. However, the scaling 
of the eigenvalues is important for the analysis. We remark that it is not 
necessary that the eigenvalues Ai, . . . , Am are the M largest eigenvalues of S 
in order for the conclusions to hold. All that is necessary is that these are the 
leading eigenvalues of the matrix £x>x>, and there is enough separation from 
the other eigenvalues of £. However, this assumption is made to simplify the 
exposition. 

Next, the condition that q n = o(n) (implicit from condition A3) is nec- 
essary for the consistency of the estimated eigenvectors from Supervised 
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PCA. Condition A4 is necessary for the identifiability of the set B. A5 im- 
plies that the contribution of the predictors {Xj : j G V \ £>} is negligible 
in our analysis. Note that 5 n is essentially measuring the "selection bias" 
for restricting analysis to B rather than V. Again, the assumption about 
the rate of decay of 8 n can be relaxed at the cost of more involved analysis 
and smaller range of values for /i n (see also the remark following Corollary 
1 ) . Too large a value of 5 n may mean that we may not be able to select the 
variables consistently. Condition A6 is an identifiability condition for set A. 

Condition A7 is needed to guarantee consistency of the variable selec- 
tion by LASSO after projection. This condition was shown to be neces- 
sary for variable selection in finite dimensional LASSO regression by Zou 
(2005) and also, implicitly by Meinshausen & Buhlmann (2006). Zhao & 
Yu (2006) termed this the "irrepresentable condition" and showed that it is 
nearly necessary and sufficient for consistency of model selection by LASSO 
when p, n — > oo. A sufficient condition for this to hold is that maxjg^ | 
E^E_4j || x< 5. Observe that S^E^- is the population regression coeffi- 
cient in the regression of Xj on {Xi : I e A}. If we are using the estimate 
then ( see proof of Lemma 2) we can replace A7 by the weaker require- 
ment 

|| ^A^nB,A^AASign(9 A ) ||oo< 5, for some 5 G (0,1). 
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4.8 LASSO solution 

We use the symbol \x to denote the penalty parameter in LASSO. The LASSO 
estimate of 9, after preconditioning, is given by 

P = argmin- II Y - X( \\ 2 2 +u II ( IL . (19) 

We also define the selected LASSO estimate of 6 by 

0^ = arg min - II Y - X( 111 +A* II C 111 ■ ( 20 ) 
For future use, we define the restricted LASSO estimate of to be 

6 A >» = arg min - II Y - X( \\ 2 2 +fi II C 111 • (21) 
The notations used here follow Meinshausen & Buhlmann (2006). 

4.9 Consistency of variable selection 

We shall prove most of our consistency results for the estimate 8 B,fl and 
indicate how (and under what conditions) the same may be proved for the 
unrestricted estimator 6^. As we shall see, when the model assumptions 
hold the former estimator is more reliable under a wider range of possible 
dimensions. The latter can consistently select the model essentially when 
Pn = 0(n K ) for some k < oo. In order to prove these results, it will be 
convenient for us to assume that we have two independent subsamples of 
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size n each, so that the total sample size is 2n. And we also assume that 
Step 1 of the variable selection algorithm (estimating B) is performed on the 
first subsample and the other steps are performed on the second subsample. 
This extra assumption simplifies our proofs (see the proof of Proposition 4 
in the Appendix) somewhat. Further, we shall assume that K, the number 
of latent components for response Y, is known. The results presented here 
hold uniformly w.r.t. the parameters satisfying assumptions A1-A8. 

Let Ag (resp. A^) denote the set of nonzero coordinates of the vector 
qB,ij- ( reS p. Q/J-y Whenever the context is clear, we shall drop the subscripts 
from A. In the following ( will be used to denote a generic value of the 
parameter. 

Proposition 1 : Let B denote the set of coordinates selected by the pre- 
liminary thresholding scheme of SPC with threshold r n . Given any c\ > 1, 



and there is a r n (ci) := "^f" ; f° r some constant d\ > 2, such that, for 



Proposition 1 tells us that we can restrict our analysis to the set B while 
analyzing the effect of preconditioning, and studying the estimator 6 B,U '. Our 
next result is about the behavior of the estimated eigenvalues and eigenvec- 
tors of the matrix Sgg := -X^X^. This result can be proved along the lines 




n > n, 



P(B = B) > 1 - n 



(22) 
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of Theorem 3.2 in Paul (2005), (see also Bair et al. (2006)) and is omitted. 

Proposition 2 : Let (vi]Sk,£k)k=i denote the first k eigenvector-eigenvalue 
pairs of Sgg. Suppose that assumptions A1-A5 hold. Then there are func- 
tions ji = 7i(Ai/<7 , • • • , Am/(To), i — 1, 2 such that, given c 2 > there exist 
d 2 , d' 2 > 1 so that, 



mi \\~ _ II ^ j /<?nVlogn q n \ogn - 

P( max u Bfe - u Sfe 2 > d 2 a ji\ (1 + \ ), B = B) = 0{n 

i<k<K V n V n 



,/ 2 ... / lo g n , 9n lo g^ 



( max |4 - 4| > 4^qT2(a/ — + — — )> B = B) = 0(n 
i<k<K \ n n 



Theorem 1 : Suppose that assumptions A1-A8 hold. If fi = fi n satisfies 
[x n = o(n~ K2 ) and /i n n^ 1_K ° VKl ^ — > oo as n — > oo, then there exists some 
c > 1 such that, for large enough n, 

F(AcA)>l-0(n~ c ) } (23) 

where ^4 = Ag . If moreover, p n is such that — lc ^ Pn = o(l) as n — ► oo, 
then ([23]) holds with A = \ n . 

Theorem 2 : With /j, — /i n and ^4 as in Theorem 1, there exists c > 1 such 
that, 

P(ici)>l-0(n" c ). (24) 

Clearly, Theorem 1 and Theorem 2 together imply that the SPC/LASSO 
procedure asymptotically selects the correct set of predictors under the stated 
assumptions. The proofs of these critically rely on the following three results. 
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Lemma 1 : Given 9 G MP, let G(9) be the vectors whose components are 
defined by 

G j (6) = --{Y-Xe i X j ) (25) 

A vector 6 with 9j = for all j G A c is a solution of (j2"Tl) if and only if, for 
all j G .4., 

Gj(6) = - sign(e iffy ^ 
1^(0)1 < /(/ if 9j = (26) 

Moreover, if the solution is not unique and |Gj(0)| < fi for some solution 9, 
then 9j = for all solutions of (I2T]) . 

Proposition 3 : Let 9^ be defined as in (1211) . Then, under the assumptions 
of Theorem 1, for any constant C3 > 1, for large enough n, 

P( sign(^ n ) = sign(fy), for all j G A) > 1 - 0(?^ C3 ). (27) 

Lemma 2 : Define 

£ fl>/1 = { .max s 10^)1 <l*}n{B = B} (28) 

On ^b iA i, 9 s is the unique solution of (1211 and 6 | - 4 ' At is the unique solution 
of fl2T|) . and 9® 41 = 9 A,fl . Also, under the assumptions of Theorem i, there 
exists a c 4 > 1 such that, for large enough n, 

= 0(^- C4 ). (29) 
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Further, if we define 

S„ = {max \G$ A ^)\ < //} n {B = B}, (30) 

jeA c 

then under the extra assumption that qn lo ^ Pn = o(l), (f29j) holds with £g iAt 
replaced by £ M . On £ M , # M is the unique solution of (EH) and 0» = = 9 A ^. 

4.10 Effect of projection 

An important consequence of the projection is that the measurement noise 
Z is projected onto a K dimensional space (that under our assumptions also 
contains the important components of the predictors of Y). This results in 
a stable behavior of the residual of the projected response A given by 

A:=Y -X9 = Y - X A 9 A . (31) 

even as dimension p n becomes large. This can be stated formally in the 
following proposition. 

Proposition 4 : Suppose that assumptions A1-A5 hold. Then there is a 
constant 73 := 73(170, X±, . . . , Xr + 1), such that for any c% > 1 there exists a 
constant > so that, for large enough n, 

P(|| A || 2 < 4(T3V^ V logn + a iy /Klogn)) >l-n~ C6 . (32) 

As a direct corollary to this we have the following result about the risk 
behavior of the OLS-estimator (under L 2 loss) of the preconditioned data 
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after we have selected the variables by solving the optimization problem 
flU}. 

Corollary 1 : Suppose that conditions of Theorem 1 hold. Then for any 
c 7 > 1, there is d 7 > such that 



P( 9 ^ -9 2 < c^o (73 V )) > 1-n , (33) 

V n V w 

where ^>° i5 = (XpC^X^Y, and .4 = 1^ = {j eV: 9®'^ ^ 0}. 

As a comparison we can think of the situation when ^4 is actually known, 
and consider the L 2 risk behavior of the OLS estimator restricted only to 
the subset of variables A. Then 9 A > OLS = (X^X^^X^F. Using the fact 
that conditional on X_4, 9 A ^ OLS has N(6^, of (X^X^) -1 ) distribution, and 
the fact that the smallest eigenvalue of is at least i^ 1 , it follows (using 
Lemma A.l) that there is a constant d 7 > such that 

P(|| e A ' OLS - 9 || 2 > d' 7 £- 1/2 a £ <[^) > 1 - n c \ (34) 



Comparing with ([531) . we see that if g n ^> logn and o\ ^> \/q n /q n , the 
estimator 9 A@ ^ l ' OLS has better risk performance than Q^ OLS . 

As a remark, we point out that the bound in (1331) can be improved under 
specific circumstances (e.g. when S n , the "selection bias" term defined in 
A5, is of a smaller order) by carrying out a second order analysis of the 
eigenvectors {u k }f =1 (see Appendix of Bair et al. (2006)). The same holds 
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for the bounds on the partial correlations ^((J — P^ A )Xj,Y), for j G -4 C , 
given the "signal" variables {X; : I G .4}, that are needed in the proof 
of Proposition 3 and Lemma 2. However, the result is given here just to 
emphasize the point that preconditioning stabilizes the fluctuation in Y — X0, 
and so, partly to keep the exposition brief, we do not present the somewhat 
tedious and technical work needed to carry out such an analysis. 

As a further comparison, we consider the contribution of the measure- 
ment noise Z in the maximal empirical partial correlation maxj g _4c — 
Px_ A )Xj,Y)\, given {X; : I G .4}. For the pre-conditioned response this 
contribution is (with probability at least 1 — 0(n~ c ) for some c > 1) of the 
order 0( °" 1 ^ g -), instead of 0( ai ^^ —) as would be the case if one uses Y 
instead of Y. So, if \ogp n 3> logn, then the contribution is smaller for the 
pre-conditioned response. Formalizing this argument, we derive the following 
asymptotic result about the model selection property of LASSO estimator 
that clearly indicates that under latter circumstances SPC + LASSO proce- 
dure can outperform conventional LASSO in terms of variable selection. 

Proposition 5 : Suppose that logp n = cn a for some a G (0, 1) and some 
c > 0. Suppose that A = A+ U with A + and A- disjoint and A- is 
nonempty such that || W2— o{n^^ a ^ 2 ). Assume that M = K , B = V 
(so that for all j £ £>, Xj are i.i.d. N(0, Uq)), and a 1 is fixed . Suppose further 
that all the assumptions of Theorem 1 hold, and there is a 5 + G (0, 1) such 
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that (if A+ is nonempty) 

^ax || Z A \ A Y> A+j ||i< S+. (35) 
Then, given c§ > 1, for all \x n > 0, for large enough n, 

P(^ 55 ° ^ ^) > 1 - n~ C \ (36) 
where Af£ SSO = {j E V : 9f ASS0 ^ n ^ 0}, where 

qLAsso,^ = argmin _ II y - X( Ml +«„ || C Ik ■ (37) 

Proposition 5 shows that if a > 1 — 2/«2, so that r\ n = o(n~^~ a ^' 2 ), 
and the assumptions of Proposition 5 are satisfied, then the SPC + LASSO 
approach (solving the optimization problem (1201) or (|19j) ) can identify A with 
appropriate choice of penalization parameter \x n (as indicated in Theorem 1 ) 
while LASSO cannot, with any choice of the penalty parameter. 



5 Classification problems and further topics 

The pre-conditioning idea has potential application in any supervised learn- 
ing problem in which the number of features greatly exceeds the number of 
observations. A key component is the availability of a consistent estimator 
for the construction of the pre-conditioned outcome variable. 

For example, pre-conditioning can be applied to classification problems. 
Conceptually, we separate the problems of a) obtaining a good classifier and 
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b) selecting a small set of good features for classification. Many classifiers, 
such as the support vector machine, are effective at finding a good separator 
for the classes. However they are much less effective in distilling these features 
down into a smaller set of uncorrelated features. 

Consider a two-class problem, and suppose we have trained a classi- 
fier, yielding estimates Pi, the probability of class 2 for observation i = 
1,2, ... N. Then in the second stage, we apply a selection procedure such as 
forward stepwise or the LASSO, to an appropriate function of pi, the quantity 
log[pi/(l — Pi)] is a logical choice. 

We generated data as in example of section [3j however we turned it into 
a classification problem by defining the outcome class g\ as 1 if yi < and 
2 otherwise. We applied the nearest shrunken centroid (NSC) classifier of 
Tibshirani et al. (2001), a method for classifying microarray samples. We 
applied forward stepwise regression both to directly (labeled FS), and to 
the output log(pi/(l-pi)) of the NSC classifier (labeled NSC/FS). 

The results of 10 simulations are shown in Figure We see that NSC/FS 
does not improve the test error of FS, but as shown in the bottom left panel, 
it does increase the number of "good" predictors that are found. This is a 
topic of further study. 
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Figure 5: Results of applying pre-conditioning in a classification setting. Top 
left panel shows teh number of test misclassification errors from forward stepwise 
regression; in teh top right panel we have applied forward stepwise regression to the 
pre-conditioned estimates from nearest shrunken centroid classifier. The proportion 
of good predictors selected by each method is shown in the bottom left. 
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Appendix 

A full version of this paper that includes the Appendix is available at 

\protect\vrule widthOpt\protect\href {http : //www-stat . Stanford. edu\string~tibs/f tp/ 

and also in arXiv archive. 
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